Introduction

My goal with this project was to attempt to answer the age old sports question of “So, Who is going to win this game?”

I wanted to see if I could use historic information about a teams offensive and defensive performances to garner an accurate prediction on whether or not they will win there upcoming game.

All data ingestion, processing, analysis, and modeling was done in R and I leaned heavily on the tidyverse and tidymodels ecosystems to get this done.


Data Retrieval

I pulled NFL play-by-play data together using the nflFastR R package. I then did some data cleaning and wrangling tasks to get the data into a usable format.

I started by getting the full play-by-play datasets from 1999-2021. Each year there are some ~50,000 plays run, and for each play, there are 372 columns of data giving information about what happened on that play.

I created a handful of helper functions that take the nflFast R play-by-play data as an input and then aggregates the data to game level data.

In the end I used data from the 1999-2021 seasons, however I completed omitted the entire 2021 season to use as a holdout set of data to test our models against.



Exploritory Data Analysis

Let’s dig into some of the data and see what we can learn.


Season win totals

Below is a box plot showing the season win totals for each NFL team over the period of record, arranged by mean annual win total. It is not surprising to see the New England Patriots leading the league with an average of 12.6 wins per season. The Patriots are followed by the Steelers, Packers, Colts, Ravens, all teams that have seen consistent success over the last 23 years. At the bottom of the league are the Browns, Lions, Jaguars, and the Raiders.



QB performances

Now how about which team’s quarterbacks have the highest EPA. EPA is a metric that determines how likely a team is to score points as a result of a play. If a QB does something good (long completion, run for a first down), then there team is more likely to score points on the ensuing play, thus Expected points were added (EPA is positive). On the other hand if a QB does something bad (incompletion, takes a big sack), then there team is less likely to score points on the ensuing play and expected points were removed (EPA is negative)

Below is a plot showing the for each team, the average QB EPA across the all seasons. It is not suprising to see teams like the Patriots, Packers, and Colts at the top of the league in terms of average QB EPA seeing as though these teams had Hall of Fame Quarterbacks for most of the period of time, Tom Brady, Brett Favre/Aaron Rodgers, and Peyton Manning, respectively. This plot matches very closely with the above season win totals boxplot. The teams with the best quarterbacks tend to have the most total wins at the end of a season.




Feature selection

I decided I wanted to capture 4 key domains of features that may influence a teams likelihood to win an upcoming game:

  • Elo Rating
  • Offense
  • Defense
  • Off field factors

Note: While exploring the data and creating the features I wanted to use in the model, I created a host of utility helper functions that work off of the play-by-play output data from the nflFastR package. Those helper functions can be found here.


Elo Rating

I created an Elo rating system for each season to create a metric that keeps track of a teams rank relative to the rest of the league. Elo rating systems were first created to rate chess players and are now commonly used in many sports such as American Football, baseball, basketball, etc. Special thanks to the creators of the elo package, your package made my life a lot easier. Here is more information on Elo Rating Systems and its inventor Arpad Elo.


Offense

I calculated offensive and defensive statistics for every game a team played in the NFL from 1999 - 2021.

Variable Type Description
Score Differential Numeric Average score differential throughtout the game
Third Down Conversion Rate Numeric Percent of 3rd downs converted by offense
Turnovers Numeric Total Turnovers by the offense (Fumbles lost + Interceptions
QB EPA Numeric Average Quarterback EPA across all plays in the game
Scoring Drive percentage Numeric Number of drives that results in a score (touchdown/fieldgoal)


Defense

I replicated many of the same offensive statistics for each team’s defense to represent how opposing offenses performed against there defense.

Variable Type Model Variable Description
Opponent Points Scored in quarter 1, 2, 3, 4 Numeric def_qtr_pts_1,2,3,4 Total Points allowed by defense at the end of each quarter
Opponent Third Down Conversion Rate Numeric def_third_down_pct Percent of 3rd downs converted by opposing offenses
Opponent Turnovers Numeric def_turnovers Total Turnovers created by defense (Fumbles lost + Interceptions
Opponent QB EPA Numeric def_qb_epa Average Quarterback EPA allowed by defense across all plays in the game
Opponent Scoring Drive percentage Numeric def_score_drives_pct Percent of Drives against the defense that resulted in points


Off field factors

I included a handful of other information known prior to the game such as if the game is a home game, a division game, the number of days of rest between games, and the team’s overall, home, and away winning percentages, and the team’s ranking via an Elo rating.

Variable Type Model Variable Description
Home Binary home Indicating if the team is the home team (1 = home, 0 = away
Division Game Binary div_game Indicating if it is a division game (1 = division game, 0 = not division game
Rest days Numeric rest_days Numeric variable for the amount of rest the team had between games
Win % Numeric win_pct Percent of all games won
Home Win % Numeric home_win_pct Percent of home games won
Away Win % Numeric away_win_pct Percent of away games won
Elo rating Numeric elo Elo rating per week in each season, relative to rest of the league

Once all of the above variables were wrangled, cleaned, and aggregated from the NFL play-by-play data, I was left with a data frame that had 1 row for every game a team played during the season, with a column for each of the aforementioned derived variables.

Corrolation

Let’s take a look at how our game summary data relates to wins.


Below is a correlation matrix that highlights the relationship between variables in our data. We are most interested in the bottom row in this plot which indicates the correlation between a team winning there game and the other variables in our data set. Darker green colors indicate positive correlations while darker red colors indicate negative correlations.

We see that Elo rating has a relatively strong positive correlation with winning, which makes alot of sense, a higher Elo rating (higher ranked team) is more likely to end up with a win. The average score differential for a team also makes good sense, if a team on average has a higher score differential (they score more points than are scored against them) then it’s logical that they would end up winning more games. The turnover variable has a strong negative correlation with winning, so if a team on average has fewer turnovers, they are more likely to end up winning the game.


At this point in my analysis, I was working with a posteriori data, so data that was collected as a product of what happened on the field that week. If I want to make any useful predictions about the coming week, I would need to conjur up some a priori data.


Cumulative Averages

To capture how well a team is doing throughout each season, I created lagged cumulative means for all of the above variables. Such that, for each week a team plays a game, the cumulative mean of all the preceeding weeks of data are calculated for every variable leading into the upcoming slate of games.

Below is the general method I used to get these lagged Cumulative averages values. Note, for some variables like winning percentage, the lagged cumulative averages was not calculated and rather only the lagged value was determined. In the example code below, the lagged winning percentage, and the lagged cumulative average QB EPA is calculated for the Arizona Cardinals 2014 season.

nfl %>% 
  dplyr::filter(season == 2014, team == "ARI") %>% 
  dplyr::select(season, week, team, win, win_pct, qb_epa) %>% 
  dplyr::group_by(season, team) %>% 
  dplyr::arrange(season, week, .by_group = T) %>% 
  dplyr::mutate(
    across(c(win_pct), ~dplyr::lag(.x), .names = "{col}_lag"),
    across(c(qb_epa),  ~dplyr::lag(dplyr::cummean(.x)), .names = "{col}_lag")
    ) %>% 
  dplyr::mutate(across(c(win_pct:qb_epa_lag), round, 2)) %>% 
  dplyr::ungroup() %>% 
  dplyr::relocate(season, week, team, win, win_pct, win_pct_lag, qb_epa, qb_epa_lag)
## # A tibble: 17 × 8
##    season  week team    win win_pct win_pct_lag qb_epa qb_epa_lag
##     <dbl> <dbl> <chr> <dbl>   <dbl>       <dbl>  <dbl>      <dbl>
##  1   2014     1 ARI       1    1          NA     -0.1       NA   
##  2   2014     2 ARI       1    1           1      0.03      -0.1 
##  3   2014     3 ARI       1    1           1      0.2       -0.04
##  4   2014     5 ARI       0    0.75        1     -0.16       0.04
##  5   2014     6 ARI       1    0.8         0.75  -0.01      -0.01
##  6   2014     7 ARI       1    0.83        0.8    0.09      -0.01
##  7   2014     8 ARI       1    0.86        0.83  -0.03       0.01
##  8   2014     9 ARI       1    0.88        0.86  -0.02       0   
##  9   2014    10 ARI       1    0.89        0.88  -0.11       0   
## 10   2014    11 ARI       1    0.9         0.89  -0.06      -0.01
## 11   2014    12 ARI       0    0.82        0.9   -0.34      -0.02
## 12   2014    13 ARI       0    0.75        0.82  -0.09      -0.05
## 13   2014    14 ARI       1    0.77        0.75   0         -0.05
## 14   2014    15 ARI       1    0.79        0.77  -0.14      -0.05
## 15   2014    16 ARI       0    0.73        0.79  -0.25      -0.05
## 16   2014    17 ARI       0    0.69        0.73   0.09      -0.07
## 17   2014    18 ARI       0    0.65        0.69  -0.4       -0.06

If you take a look at ARI’s week 5 game (row 5), the win_pct_lag column for week 5 is represented by the teams win_pct from the previous week, week 4 in this case, or in other words, the team’s winning percentage leading into the week 5 match up. On the other hand, the qb_epa_lag column was calculated using the mean of qb_epa from weeks 1-4.


I did this lagging process across all of my variables such that for each row with a game outcome (win/loss), that team has variables representing their performances up until that point in the season.


Now our data is set up so that all the information used to inform our prediction is data that would be available to us prior to the upcoming week of games. This is important because information such as the number of turnovers during the game is not information we would have prior to the game, and thus can’t be used as a predictor in our model, we can only use historic data to inform our predictions.

Home team

Initially I had been running models with 2 data points for each game, one being the home team perspective and the other being the away team perspective. I pretty quickly ran into some major overfitting in my models and realized it was a better idea to only use data from the home team.



Modeling

I decided I wanted to run my data across a panel of 6 different models and see how they all perform against each other

Data Budget/Splits

The first step was to split up our data into testing and training splits (75% training, 25% testing). When I split my data, I chose to stratify the data by the binary win/loss value to ensure there was the same proportion of wins and losses in our training and testing data splits.

# Set random seed
set.seed(234)

# Partition training/testing data, stratified by win/loss
nfl_split <- rsample::initial_split(nfl_df, strata = win)

# training data split - 75%
nfl_train <- rsample::training(nfl_split)

# testinng data split - 25%
nfl_test  <- rsample::testing(nfl_split)

Data Preprocessing

Using the recipes package, I made ID variables for information about the game and when it happened. I then applied some recipe steps to my training data. Because I decided to only use the home teams data, this meant that there was going to be more wins than losses in my dataset because the home team tends to win more often then the away team (56% wins / 44% losses). To account for the slight imbalance of wins to losses in the data, I chose to use the themis package to upsample my data using the themis::step_smote() function.


Recipe steps

  • step_zv is applied to all predictors to remove all variables with only one variable (zero-variance)
  • step_normalize is used to normalize all the numeric predictors in the training data so that they all have a standard deviation of 1 and a mean of 0
  • step_smote is applied to our dependent win variable to fix the class imbalance due to having slightly more wins than losses in the data. The step_smote function works by upsampling the minority class and generating new, synthetic data by using the nearest-neighbors of the minority class instances.

Below I create the two recipe steps that we will apply to our training data before modeling.

# Base recipe
base_recipe <- 
  recipes::recipe(
  formula = win ~ ., 
  data    = nfl_train
  ) %>% 
  recipes::update_role(
    game_id, team, opponent, season, week, new_role = "ID"
  ) 

# Normalize, SMOTE algo upsampling recipe
norm_smote_recipe <- 
  base_recipe %>% 
  recipes::step_zv(recipes::all_predictors()) %>% 
  recipes::step_normalize(recipes::all_numeric_predictors())
  themis::step_smote(win, over_ratio = 0.9,  skip = T)
  
# zero variance SMOTE algo upsampling recipe
zv_smote_recipe <- 
  base_recipe %>% 
  recipes::step_zv(all_predictors()) %>% 
  themis::step_smote(win, over_ratio = 0.9,  skip = T)


Cross-validation folds

I created 10 cross-validation folds from the training data. As I did with the initial training/testing split, I made sure to create stratified resamples using the win/loss variable to ensure the same proportion of wins and losses appear in CV folds as the do in the original data.

# Set random seed 
set.seed(432)

# Cross-validation folds
nfl_folds <- rsample::vfold_cv(nfl_train, v = 10, strata = win)


workflowsets

Using the workflowsets I created a workflow_set object containing each of the data preprocessing recipes and corresponding model specifications.

# Workflow set of candidate models
nfl_wfs <-
  workflowsets::workflow_set(
    preproc = list(
      kknn_rec        = norm_smote_recipe,
      glmnet_rec      = norm_smote_recipe,
      xgboost_rec     = zv_smote_recipe,
      nnet_rec        = norm_smote_recipe,
      svm_poly_rec    = norm_smote_recipe,
      svm_rbf_rec     = norm_smote_recipe
    ),
    models  = list(
      knn            = knn_spec,
      glmnet         = glmnet_spec,
      xgboost        = xgboost_spec,
      nnet           = nnet_spec,
      svm_poly       = svm_poly_spec,
      svm_rbf        = svm_spec
    ),
    cross = F
  )


Tuning Hyperparameters

The next step was to tune our hyperparameters for each of models. Using the tune package I applied the tune::tune_grid() function across my workflowset object containing my model recipes and specifications. 20 random candidate parameters were created using the dials::grid_latin_hypercube() function from the dials package

Parallel processing

This step is the most time consuming and resource intensive process of a machine learning workflow. So, to speed up the process I run the tuning process across multiple cores on my computer, making use of my computer’s multiple processors.

Racing methods

To get another boost in processing time, I made use of the tune_race_anova from the finetune package, which makes use of a repeated measure ANOVA model to iterativily eliminates tuning parameters that are unlikely to yield the best results. Combining parrallel processing and racing methods can improve processing times by ~35 fold.

# Choose metrics
my_metrics <- yardstick::metric_set(roc_auc, pr_auc, accuracy, mn_log_loss)

# Set up parallelization, using computer's other cores
parallel::detectCores(logical = FALSE)
modeltime::parallel_start(6, .method = "parallel")

# Set Random seed
set.seed(589)

# Tune models in workflowset
nfl_wfs <-
  nfl_wfs %>%
  workflowsets::workflow_map(
    "tune_grid",
    resamples = nfl_folds ,
    grid      = 20,
    metrics   = my_metrics,
    control   = tune::control_grid(
      verbose   = TRUE,
      save_pred = TRUE),
    verbose   = TRUE
  )

  # Stop parrallelization
modeltime::parallel_stop()




Model Evaluation

Model comparisons

Now we can compare how all of our models did compared to one another and make a decision as to which one we want to use to make predictions. The plots below shows how the 6 models performed on the set of resampling data.

If you take a look at the ROC AUC and mean log loss plots the best performing models are the Logistic Regression (logistic_reg), the Support Vector Machines (svm_poly/svm_rbf), the Multilayer Perceptron (mlp) and the Gradient Boosted Decision Trees (boost_trees).

I decided to set aside the Multilayer Perceptron model due to the slight increase in mean log loss relative to the other top performing models.

In particular, the boosted trees looks to perform at approximately the same level asthe mlp model when it comes from correctly predicting wins (sensitivity) and correctly predicting losses (specificity), without the increase in log loss.


A standard threshold for log loss when it comes to binary predictions is ~0.69 because anything higher than that and you aren’t beating the probability of a 50-50 guess. The table below shows the mean log loss for each model. The best log loss you can have is 0 and the larger the number gets the worse the model is at predicting the correct outcome.

The table belows shows the performance metrics for how each model.

resamp_metrics %>% 
  dplyr::arrange(mean_log_loss) %>% 
  ggplot() +
  geom_point(aes(x = reorder(model, mean_log_loss), y = mean_log_loss), size = 2) +
  geom_hline(yintercept = 0.69, size = 1.5) 


Variable Importance

Looking at at variable importance plots, generated from the vip package, can give us an idea of which variables are the most important to ours models.

Below is the variable importance scores for the Logistic Regression and the XGBoost models.


The most important variables in both models looks to be the team’s Elo ratings, the cumulative average score differential and the opponent’s winning percentage.



Model Performance on the Test Data

I then selected the tuning parameters for each model that had the best performance metrics on the data resamples. Using these model parameters, I refit each model for the last time using just the initial data split, fitting one last time on the training data and evaluating on the testing data. I then used the final fitted models to make predictions. Let’s check out the results!


And in table form…


The nearest neighbor model sees a big drop in mean log loss between the training and testing data, besides this, the remaining models appear to perform similarly with both the training and testing data, which gives us some more confidence that these models would perform similarly in the real world.



ROC Curves

Receiver Operator Characteristic curves, (aka ROC curves) are a way of understanding how well a model is predicting the True Positive events vs. the False Positive events.

In our scenario, with models that are trying to predict if an NFL team will win, a True Positive event would be that the model predicts a win and the team actually wins, while in the False Positive event the model predicts a win, and the team actually loses.

The True Positive Rate (TPR) is on the Y axis and the False Positive Rate (FPR) is on the X axis.

The TPR, or sensitivity, along the Y axis indicates the probability that the model predicted a win and the team actually won. While FPR or, 1 - specificity, along the X axis indicates the probability that the model predicted a win and the team actually lost.

The closer the ROC curve is to the top left corner of the plot the better the model is doing at correctly classifying the data. A perfect ROC Curve would be a 90 degree angle along the top left corner of the plot and would indicate that the model is able to perfectly classify the data correctly.

roc_curve <-
  roc_df %>%
  ggplot2::ggplot() +
  ggplot2::geom_abline(
    lty   = 2, 
    alpha = 0.7,
    color = "black", 
    size  = 1)  +
  ggplot2::geom_line(
    ggplot2::aes(
      x     = 1 - specificity, 
      y     = sensitivity,
      color = model), 
    alpha = 0.6, 
    size  = 1
    ) +
  # ggplot2::scale_color_manual(values = RColorBrewer::brewer.pal(6, "Dark2")) +
  ggplot2::scale_color_manual(values = c( "red3", "forestgreen", "dodgerblue3", "coral3", "cyan3", "darkorange")) +
  ggplot2::coord_equal() +
  ggplot2::labs(
      title    = "ROC Curves",
      subtitle = "Test data",
      x        = "False Positive Rate (1 - Specificity)",
      y        = "True Positive Rate (Sensitivity)"
    ) + 
  apatheme +
  ggplot2::theme( 
    axis.title.x        = ggplot2::element_text(hjust=0.5,  vjust = -2),
    axis.title.y        = ggplot2::element_text(hjust=0.5, vjust = 2)
    )


# ggsave(
#     here::here("img", "roc_curve.png"),
#     roc_curve,
#     width  = 12,
#     height = 10
#   )


No surprises here, using the test data all of the models are performing similarly, except for the K-nearest neighbors model, which is much worse at classifying the data then the rest of the models. This ROC curve just visually confirms what we saw in the model metric summary table. Based on these metrics, I am going to drop the K nearest neighbors model for now and focus on the rest of the models


Confusion Matrices

A confusion matrix is a helpful way of understanding where a classification model is correct or incorrect in it’s predictions. At first glance, I thought the name confusion matrix came from the fact that I was confused. Turns out the name has more meaning, its purpose is to highlight where a classification model is confused in its predictions…not when the modeler is confused…


So, if this is your first time seeing a confusion matrix this illustration might help.

Insert hot dog confusion matrix image


A few notes from looking at this confusion matrix: - All three of my models do a better job of correctly predicting losses in the test data than Las Vegas did in the same games

  • The Support Vector Machine has more total correct predictions than both other models

  • The SVM RBF Model is the highest sensitivity model, it correctly predict wins better than the all other models as well as Las Vegas’s predictions

  • The Logistic Reg. Model predicts losses better than the SVM and XGBoost models

  • ~35% of correct predictions are losses, the remaining ~65% are correctly predicted wins (~43% of the test data were actual losses so this isn’t too far off)

  • For this data/domain, having the model predict more wins than losses makes logical sense, the home team tends to win more often than they lose

  • In general, the models have the most trouble classifying FP


And here are the summary statistics for these confusion matrices

confmat_tbl %>%
  flextable::flextable() %>% 
  flextable::add_header_row(
    values    = c("", "Metrics"),
    colwidths = c(1, 5)
    ) %>%
  flextable::theme_box() %>% 
  flextable::align(align = "center", part = "all")


Looking at the SVM model in the table above, we see a sensitivity of ~0.72, which means the SVM model correctly predicted the home team to win 72% of the time. The specificity of 0.73 indicates that 73% of home team losses were correctly predicted. Among all the models, the SVM model looks to do the best at correctly identifying wins and losses for the home team.


Prediction from the test data

If we look at the 1373 games in our test data, and generate predictions for those games, we can calculate the percent of games in each year which our model correctly predicted the outcome to.




Holdout dataset (2021 Predictions)

Now let’s apply the fitted models to the 2021 season holdout data and see how those predictions fare against Vegas’ predicted winners.

We can visualize percent of games that Las Vegas got correct in 2021 and compare it to the ML models % of correct predictions. The X axis has the percent of correct predictions and the Y axis is the week of the NFL season.


Looks like our predictions are right in line with Vegas, in some weeks we do better than Vegas are predicting the correct game outcomes and some weeks we do worse.


The table below shows the actual outcomes of the 2021 season, the favored team according to Las Vegas, and the prediction’s from my fitted models. Green highlighted cells indicate that the prediction lined up with the actual game outcome, and red highlighted cells mean that the prediction was incorrect. The table is from the home team’s persepctive (i.e. a win in the actual_outcome column means the home team won).

First we’ll can check out one of the weeks that our models outperformed Vegas.

Good week of predictions

All of our models did better than Vegas in week 12,

# A week with more preditions correct than Las Vegas
good_week <- 
  pred_table %>% 
  dplyr::filter(week == 12)

# Color code TP and TN as green and FP and FN as red
good_week_color <- ifelse(good_week[, c(5, 6, 7, 8, 9)] == good_week[, c(4, 4, 4, 4, 4)], "#85E088", "indianred2")

# Flextable with color coding
good_week %>% 
  flextable::flextable() %>% 
  flextable::bg(j = c(5:9), bg = good_week_color) %>% 
  flextable::add_header_row(
    values    = c(" ", "Vegas", "Models"), # values = c(" ","Vegas Predictions", "Model Predictions"),
    colwidths = c(4, 1, 4)
    # values    = c(" ", "Vegas / Model Predictions"), colwidths = c(4, 4)
    ) %>% 
  flextable::theme_box() %>% 
  flextable::align(align = "center", part = "all")



And now a bad week…

Bad week of predictions

# A week with more preditions correct than Las Vegas
bad_week <- 
  pred_table %>% 
  dplyr::filter(week == 15)

# Color code TP and TN as green and FP and FN as red
bad_week_color <- ifelse(bad_week[, c(5, 6, 7, 8, 9)] == bad_week[, c(4, 4, 4, 4, 4)], "#85E088", "indianred2")

# Flextable with color coding
bad_week %>% 
  flextable::flextable() %>% 
  flextable::bg(j = c(5:9), bg = bad_week_color) %>% 
  flextable::add_header_row(
    values    = c(" ", "Vegas", "Models"), 
    colwidths = c(4, 1, 4)
    ) %>% 
  flextable::theme_box() %>% 
  flextable::align(align = "center", part = "all")




Conclusion

We did it, we implemented NFL data into a Machine learning workflow and generated some reasonably accurate predictions!

To recap what I did: - Started with raw NFL play-by-play data - Cleaned and processed the data into a model-ready format - Selected a handful of ML models, - trained and tested the models with our prepped data, - Split our data into training and testing datasets - Trained our models using the training data and evaulated them using 10 fold cross validation. - Determined optimal hyperparameters and used these to refit our models on the entire dataset - Generated predictions and compared our predictions to the predicted favorites according to Las Vegas

I’m not a huge sports bettor, but I really just wanted to make a model that could potentially make me money if I was.